关于影响白葡萄酒品质的因素探究

这份报告探究了关于包含4,898 种白葡萄酒,及 11个量化每种酒化学成分的变量的数据集。

Univariate Plots Section

## [1] 4898   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

-关于数据集

品质 min 3 mean 5.88 max 9 酒精度 min 8 mean 10.51 max 14.2 pH值 min 2.7 mean 3.19 max 3.82

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

硫酸盐是呈现右偏分布的,平均值接近0.49。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH值呈正态分布,平均值为3.2。

## Warning: Removed 3 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

密度为正态分布,平均值接近0.994,但有些超过1.005g / cm^3。

## Warning: Removed 6 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

总二氧化硫含量正态分布,平均值接近138,有些超过300。

## Warning: Removed 17 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

游离硫二氧化物呈正态分布,平均值接近35mg / dm^3,50%的数据在23和46之间, 有相当一部分数据显示超过50。但根据资料信息游离SO2浓度超过50 ppm的信息,二氧化硫在葡萄酒的鼻子和口味中变得明显。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

大约50%的数据的氯化物含量在0.036 ~ 0.05g / dm ^ 3 之间。

## Warning: Removed 11 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

柠檬酸缓慢呈正态分布,平均值接近0.34g / dm ^ 3。

## Warning: Removed 8 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

挥发性酸度呈现了右偏分布,50%的数据在0.21~0.32g / dm ^ 3之间,但有些超过1g / dm ^ 3,资料显示过高的水平会导致不愉快的‘酸味’。

## Warning: Removed 2 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

固定的酸度为正态分布,平均值接近6.8g / dm ^ 3,几乎50%的数据分布在6.3和7.3之间。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

为了更好地了解残糖的分布情况,对长尾数据进行了取log10观察特征.呈现双峰分布第一次出现在1.5g/L左右,第二个在10g/L左右。基于资料信息’很难找到含有少于1g/L葡萄酒的葡萄酒, 大于45g/L被认为是甜的’。

## 
##   Bone_Dry        Dry    Off_Dry Semi_sweet     Sweety 
##        170       1927       1975        825          1
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

口味分类

继续通过搜索葡萄酒的相关知识,我们按照酒中的残糖量将我们根据葡萄酒的残糖量分为 Bone_Dry(0,1],dry(1,4],off-dry(4,12],Semi_sweet(12,45],sweet(45,66] 来划分口味为5种, 这份数据中 bone—dry有170份,Dry为1927,Off-dry为1975,semi-sweet为825,只有一份 是sweet口味。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960

资料中提到‘most acids involved with wine or fixed or nonvolatile’,为了容易理解我们将使用所有酸度的和来分析,应该会发现新特征,所以设置了总酸度新变量。

Univariate Analysis

What is the structure of your dataset?

包含了4898个葡萄酒的样本数据, 关于葡萄酒的成分有酒精度,pH值,各种离子成分,以及残糖量,密度等,根据这些指标还有一列是专家的评定结果,分数在010之间(差非常好)。我们读取数据集然后删掉了名称为X的一列,因为本身有index可以使用。

What is/are the main feature(s) of interest in your dataset?

数据集的主要特征是品质–专家的评级,我想尝试通过数据集中的项目来预测葡萄酒的品质。

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

通过查阅一些关于葡萄酒的相关文章,往往甜味(糖+酒精),酸味,苦味(酚–单宁)会决定酒的品质。所以尝试分析了:固定酸度,挥发性酸度,柠檬酸,总酸度,残糖量,酒精和质量这些特征。

Did you create any new variables from existing variables in the dataset?

设置了总酸度和口味两个新变量

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

解残糖的分布情况,原始数据为右偏长尾分布,对长尾数据进行了取log10观察特征.呈现双峰分布。

Bivariate Plots Section

## Warning in ggcorr(whitewine, method = c("all.obs", "spearman"), nbreaks =
## 4, : data in column(s) 'sugar.taste' are not numeric and were ignored

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
## allacidity              0.98717874       0.07157062  0.394143356
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## allacidity               0.10473749  0.04552987       -0.0451333172
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## allacidity                    0.113188502  0.27560881 -0.4306513315
##                        sulphates     alcohol      quality  allacidity
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831  0.98717874
## volatile.acidity     -0.03572815  0.06771794 -0.194722969  0.07157062
## citric.acid           0.06233094 -0.07572873 -0.009209091  0.39414336
## residual.sugar       -0.02666437 -0.45063122 -0.097576829  0.10473749
## chlorides             0.01676288 -0.36018871 -0.209934411  0.04552987
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067 -0.04513332
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218  0.11318850
## density               0.07449315 -0.78013762 -0.307123313  0.27560881
## pH                    0.15595150  0.12143210  0.099427246 -0.43065133
## sulphates             1.00000000 -0.01743277  0.053677877 -0.01185225
## alcohol              -0.01743277  1.00000000  0.435574715 -0.11751272
## quality               0.05367788  0.43557472  1.000000000 -0.13137721
## allacidity           -0.01185225 -0.11751272 -0.131377207  1.00000000
##                      residual.sugar   chlorides free.sulfur.dioxide
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## allacidity               0.10473749  0.04552987       -0.0451333172
##                      total.sulfur.dioxide     density            pH
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## allacidity                    0.113188502  0.27560881 -0.4306513315
##                        sulphates     alcohol      quality  allacidity
## residual.sugar       -0.02666437 -0.45063122 -0.097576829  0.10473749
## chlorides             0.01676288 -0.36018871 -0.209934411  0.04552987
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067 -0.04513332
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218  0.11318850
## density               0.07449315 -0.78013762 -0.307123313  0.27560881
## pH                    0.15595150  0.12143210  0.099427246 -0.43065133
## sulphates             1.00000000 -0.01743277  0.053677877 -0.01185225
## alcohol              -0.01743277  1.00000000  0.435574715 -0.11751272
## quality               0.05367788  0.43557472  1.000000000 -0.13137721
## allacidity           -0.01185225 -0.11751272 -0.131377207  1.00000000

根据图像可以看出以下项目有较大相关性,建立图形进一步观测。

-residual.sugar & density -chlorides & total.sulfur.dioxide -alcohol & density -ph & quality -allacidity & quality -residual.sugar & quality -alcohol & quality -residual.sugar & alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

葡萄酒中残糖和密度之间的关系非常强, 相关性为0.8。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$alcohol and whitewine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

酒精和密度之间的关系也非常强大约-0.78。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$quality and whitewine$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$chlorides and whitewine$total.sulfur.dioxide
## t = 14.202, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1718612 0.2256597
## sample estimates:
##       cor 
## 0.1989103

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$quality
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683

可以看出这些组合的相关性很小。评级和残糖无关,我猜测对于葡萄酒的评级会受专家个人喜爱口味的影响。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$allacidity and whitewine$quality
## t = -9.273, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1587994 -0.1037525
## sample estimates:
##        cor 
## -0.1313772

总酸度和品质的相关值是-0.13,,看到这些特征的线性模型,它几乎是垂直线。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

残糖和酒精的相关度大约为-0.45

酒精和品质的箱线图,评级5的平均值酒精度相对所有评级最低,69评级随度数增高均值递增,但35级之间相反,可以看出这对组合似乎相关度不高。

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

在双变量分析中可以说密度特征与酒精和残糖密切相关。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

但是,没有找到与品质有非常显着的单一关系的因素。我们可以说酿造优质的葡萄酒并不那么容易。 最有趣的关系涉及密度特征。 事实上,看到特征之间的相关性,密度几乎总是最高的值。

What was the strongest relationship you found?

密度和残糖值的相关度为0.84 ,密度和酒精的相关度为-0.78

Multivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

葡萄酒的品级与密度和残糖都有密切的相关性,在品质提高的时候线性模型向左位移。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

这张图可以清楚地看出我们根据残糖量来划分了葡萄酒的口味。

品质较高的密度与酒精浓度值似乎在图的左上方,而品质较低的密度与酒精浓度数值在左侧有所下降,也就是说随着品质的升高酒精&密度的整体水平向左上方位移,但并不一定是说品质越好的葡萄酒的酒精度数就很高。

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

随着甜度的增加lm直线向右位移,但是似乎甜口的葡萄酒的酒精与密度值与干型葡萄酒相比更低一些。

-探究品质到底还和什么因素有直观的特征

随着pH值的上升总酸量下降,这是由于葡萄酒中游离的SO2离子量远远大于醋酸离子,SO2离子属于强酸弱碱离子所以溶液会显酸性(ph<7),但是醋酸离子属于强碱弱酸,溶液则显碱性(pH > 7),我们所说的柠檬酸也是一样,资料中说“葡萄酒中醋酸的含量,如果含量过高,会导致不愉快的酸味”,这里的酸并不是指葡萄酒的pH值,仅仅是说味觉的口感。

-通过以上的图表得出高品质的葡萄酒似乎并不容易。请看接下来的分析吧。

## 
## Calls:
## m1: lm(formula = quality ~ allacidity, data = whitewine)
## m2: lm(formula = quality ~ allacidity + alcohol, data = whitewine)
## m3: lm(formula = quality ~ allacidity + alcohol + log10(residual.sugar), 
##     data = whitewine)
## 
## ===================================================================
##                               m1            m2            m3       
## -------------------------------------------------------------------
##   (Intercept)                6.856***      3.260***      2.730***  
##                             (0.106)       (0.145)       (0.155)    
##   allacidity                -0.131***     -0.081***     -0.086***  
##                             (0.014)       (0.013)       (0.013)    
##   alcohol                                  0.307***      0.343***  
##                                           (0.009)       (0.010)    
##   log10(residual.sugar)                                  0.288***  
##                                                         (0.031)    
## -------------------------------------------------------------------
##   R-squared                  0.017         0.196         0.211     
##   adj. R-squared             0.017         0.196         0.210     
##   sigma                      0.878         0.794         0.787     
##   F                         85.989       597.586       435.156     
##   p                          0.000         0.000         0.000     
##   Log-likelihood         -6311.978     -5819.603     -5775.542     
##   Deviance                3774.694      3087.211      3032.164     
##   AIC                    12629.956     11647.206     11561.084     
##   BIC                    12649.446     11673.192     11593.567     
##   N                       4898          4898          4898         
## ===================================================================
## 
## Call:
## lm(formula = quality ~ allacidity + alcohol + log10(residual.sugar), 
##     data = whitewine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5799 -0.5287 -0.0104  0.4770  3.2541 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.729844   0.154573  17.661  < 2e-16 ***
## allacidity            -0.086298   0.012768  -6.759 1.55e-11 ***
## alcohol                0.343059   0.009984  34.361  < 2e-16 ***
## log10(residual.sugar)  0.288359   0.030592   9.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7871 on 4894 degrees of freedom
## Multiple R-squared:  0.2106, Adjusted R-squared:  0.2101 
## F-statistic: 435.2 on 3 and 4894 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

双变量分析时,我们发现每种评级的葡萄酒中残糖量和酒精密度的相关性很高,通过建立混合酸度,酒精和残糖量的线性模型中,r方为0.21,存在21%左右的品质差别相关度。

Were there any interesting or surprising interactions between features?

葡萄酒中游离的SO2离子让 ph<7,但口感上的酸和pH值没有直接关系。

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

为了预测葡萄酒的品质,我创建了一个线性模型,试图弄清楚葡萄酒的品质与酒精+糖和总酸度的关系。 但是该模型似乎不是非常准确,m3列中显示的3颗星,模型中有0.21的相关影响。

Final Plots and Summary

Plot One

## 
##   Bone_Dry        Dry    Off_Dry Semi_sweet     Sweety 
##        170       1927       1975        825          1

Description One

将残糖值的长尾数据取log10处理之后出现了双峰分布的直方图,根据葡萄酒的含糖量我们把这份数据的葡萄酒分成Bone_Dry,Dry,Off_Dry,Semi_sweet,Sweety五种口味,或许可以解释为该消费群体对Dry,Off_Dry口味更加偏爱,但是这里只出现了一个sweet,也许是一个利口酒,不管怎么说算是一个异常值。

Plot Two

## 
##  Pearson's product-moment correlation
## 
## data:  whitewine$residual.sugar and whitewine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Description Two

可以看出残糖量是与密度关切度最高的项目,随着残糖量的数值增加,密度也增大,几乎呈现线性关系,相关度约为84%。

Plot Three

## 
## Call:
## lm(formula = quality ~ allacidity + alcohol + log10(residual.sugar), 
##     data = whitewine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5799 -0.5287 -0.0104  0.4770  3.2541 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.729844   0.154573  17.661  < 2e-16 ***
## allacidity            -0.086298   0.012768  -6.759 1.55e-11 ***
## alcohol                0.343059   0.009984  34.361  < 2e-16 ***
## log10(residual.sugar)  0.288359   0.030592   9.426  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7871 on 4894 degrees of freedom
## Multiple R-squared:  0.2106, Adjusted R-squared:  0.2101 
## F-statistic: 435.2 on 3 and 4894 DF,  p-value: < 2.2e-16

Description Three

尝试通过m建立葡萄酒品质和总酸度,酒精,残糖值的log10的模型来探索相关度,但是只有0.21,而且发现较高品质的葡萄酒似乎酸度较低,酒精和残糖含量较高

Reflection

首先我将这个包含了4898条白葡萄酒信息的数据集进行探索性数据分析以了解数据的特征。然后查阅了相关的资料了解到决定葡萄酒的品质有许多额相关因素,比如从化学成分来讲甜味(糖+酒精),酸味,苦味(酚–单宁)会决定酒的品质。还将这些葡萄酒按残糖度进行了口味的分类。尝试分析了:固定酸度,挥发性酸度,柠檬酸,总酸度,残糖量,酒精和质量这些特征,并且分析相关度。我对各种酸的化学成分对葡萄酒品质影响研究不足,将酸度求和,建立了一个模型试图来预测葡萄酒的评级,但是结果并没有很理想,只有21%的影响力。猜想是否柠檬酸虽然含量低,但在口感上对品质还有更大的影响。。 一些文章指出葡萄酒的评级还会受到品牌,产地等等的因素的影响,这份数据中只有白葡萄酒的信息,也没有提到关于品牌和产地等其他的信息,所以该探索研究还存在很大的局限性。 在今后的进一步分析中,我认为可以尝试查找这些葡萄酒的到品牌,产地信息,要对新加入的项目进一步分析,来提高模型的影响度。